Reduction of Morpho-Syntactic Features in Statistical Machine Translation of Highly Inflective Language
نویسندگان
چکیده
We address the problem of statistical machine translation from highly inflective language to less inflective one. The characteristics of inflective languages are generally not taken into account by the statistical machine translation system. Existing translation systems often treat different inflected word forms of the same lemma as if they were independent of each other, although some interdependencies exist. On the other hand we know that if we reduce inflected word forms to common lemmas, some information is lost. It would be reasonable to eliminate only the variations in inflected word forms, which are not relevant for translation. Inflectional features of words are defined by morpho-syntactic descriptions (MSD) tags and we want reduce them. To do this the explicit knowledge about both languages (source and target language) is needed. The idea of the paper is to find the information-bearing MSDs in source language by data-driven approach. The task is performed by a global optimization algorithm, named Differential Evolution. The experiments were performed using freely available parallel English–Slovenian corpus SVEZ-IJS, which is lemmatized and annotated with MSD tags. The results show a promising direction toward optimal subset of morpho-syntactic features.
منابع مشابه
Slovenian to English Machine Translation using Corpora of Different Sizes and Morpho-syntactic Information
Word based statistical machine translation has emerged as a robust method for building machine translation systems. Inflective languages point out some problems with the approach. Data sparsity is one of them. It can be partly solved by enlarging the training corpus and/or including richer linguistic information: lemmas and morpho-syntactic features. Acquisition of a large bilingual parallel co...
متن کاملIntegrating morpho-syntactic features in English-Arabic statistical machine translation
This paper presents a hybrid approach to the enhancement of English to Arabic statistical machine translation quality. Machine Translation has been defined as the process that utilizes computer software to translate text from one natural language to another. Arabic, as a morphologically rich language, is a highly flexional language, in that the same root can lead to various forms according to i...
متن کاملMorphology In Statistical Machine Translation From English To Highly Inflectional Language
In this paper, we investigate the role of morphology in phrase-based statistical machine translation (SMT) from English to the highly inflectional Slovenian language. Translation to an inflectional language is a challenging task because of its morphological complexity. Rich morphology increases data sparsity and worsens the quality of statistical machine translation. The idea of the paper is to...
متن کاملStatistical Machine Translation of Parliamentary Proceedings Using Morpho-Syntactic Knowledge
This paper presents an overview of the University of Washington statistical machine translation system developed for the 2006 TCSTAR evaluation campaign. We use a statistical phrase-based system with multiple decoding passes and a log-linear probability model. Our main focus was on exploring the possibility of using morpho-syntactic knowledge (lemmas and part-of-speech tags) for word alignment,...
متن کاملRich morpho-syntactic descriptors for factored machine translation with highly inflected languages as target
The baseline phrase-based translation approach has limited success on translating between languages with very different syntax and morphology, especially when the translation direction is from a language with fixed word structure to a highly inflected language. There are two main points to improve on: morphological translation equivalence and long range reordering. Translating the correct surfa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Informatica, Lith. Acad. Sci.
دوره 21 شماره
صفحات -
تاریخ انتشار 2010